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Abstract 


Training  Videos 


We  present  a  compositional  model  for  video  event  detec¬ 
tion.  A  video  is  modeled  using  a  collection  of  both  global 
and  segment-level  features  and  kernel  functions  are  em¬ 
ployed  for  similarity  comparisons.  The  locations  of  salient, 
discriminative  video  segments  are  treated  as  a  latent  vari¬ 
able,  allowing  the  model  to  explicitly  ignore  portions  of  the 
video  that  are  unimportant  for  classification.  A  novel,  mul¬ 
tiple  kernel  learning  (MKL)  latent  support  vector  machine 
(SVM)  is  defined,  that  is  used  to  combine  and  re-weight 
multiple  feature  types  in  a  principled  fashion  while  simul¬ 
taneously  operating  within  the  latent  variable  framework. 
The  compositional  nature  of  the  proposed  model  allows  it 
to  respond  directly  to  the  challenges  of  temporal  clutter  and 
intra-class  variation,  which  are  prevalent  in  unconstrained 
internet  videos.  Experimental  results  on  the  TRECVID  Mul¬ 
timedia  Event  Detection  2011  (MEDll)  dataset  demon¬ 
strate  the  efficacy  of  the  method. 

1.  Introduction 


Eigure  1:  A  test  video  can  be  described  using  pieces  of 
similar  training  videos.  Similarity  might  be  defined  from 
different  perspectives.  In  this  example,  parts  of  the  test 
video  from  the  board  trick  event  are  similar  to  three  dif¬ 
ferent  videos  in  terms  of  motion  and  sound  (green),  pure 
motion  (purple)  or  motion  and  texture  (yellow). 


Multimedia  event  detection  in  unconstrained  video  col¬ 
lections  is  a  challenging  problem.  Event  categories  are  di¬ 
verse  and  exhibit  large  intra-class  variation.  Additionally, 
videos  may  be  composed  of  a  small  number  of  important 
segments,  while  the  remaining  portions  of  the  video  are  in¬ 
effective  for  classification. 

Consider  the  example  video  from  the  board  trick  cate¬ 
gory  in  Eig.  I.  This  video  contains  segments  focusing  on 
the  snowboard,  the  person  jumping,  is  shot  in  an  outdoor, 
ski-resort  scene,  and  has  fast-paced  theme  music.  Together, 
all  of  these  pieces  of  evidence  can  lead  an  algorithm  to  de¬ 
clare  that  this  video  is  from  the  relevant  category. 
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Building  a  model  that  can  correctly  categorize  this  type 
of  video  is  challenging.  Arguably,  such  a  model  must  rea¬ 
son  about  which  temporal  segments  within  the  video  con¬ 
tain  relevant  evidence.  Additionally,  grouping  these  seg¬ 
ments  into  different  mid-level  categories,  or  “scene  types” 
may  be  beneficial.  Eor  the  board  trick  event,  a  particular 
video  may  involve  a  surfboard,  skateboard,  or  snowboard 
trick,  but  is  unlikely  to  include  all  three.  Grouping  segments 
into  their  relevant  scene  types  can  improve  recognition.  Ei- 
nally,  the  model  must  utilize  a  variety  of  different  low-level 
features  in  order  to  make  such  a  decision. 

In  this  paper  we  present  a  novel,  compositional  model 
for  video  event  detection.  Our  model  uses  a  latent  variable 
framework  to  localize  the  discriminative  temporal  segments 
of  a  video.  These  temporal  segments  are  matched  to  training 
segments  of  the  same  scene  type  via  kernels  that  combine 
information  from  several  feature  modalities.  The  test  video 
is  explained  as  a  composition  of  related  training  videos. 

The  main  contribution  of  this  paper  is  the  theoretical  de¬ 
velopment  of  a  formulation  and  learning  algorithm  for  this 
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type  of  model.  The  proposed  compositional  method  has  two 
key  novel  aspects:  (1)  a  weakly  supervised  method  for  lo¬ 
calizing  only  the  most  salient  evidence  for  classification  in 
a  video  sequence.  This  method  does  not  require  manual 
marking  of  the  salient  segments  -  they  are  automatically 
extracted  and  labeled  by  scene  type.  (2)  A  novel  multiple 
kernel  learning  algorithm  with  structured  latent  variables 
that  permits  the  principled  combination  of  multiple  differ¬ 
ent  low-level  features  in  a  single  integrated  framework. 

2.  Previous  Work 

Event  detection  in  unconstrained  internet  videos  is  an  ac¬ 
tive  area  of  research.  We  consider  the  TRECVID  MED  11 
dataset  -  a  large,  diverse,  and  challenging  video  collec¬ 
tion.  Among  the  top  ranking  methods  on  this  dataset  is 
the  work  of  Natarajan  et  al.  [7],  which  performs  a  princi¬ 
pled  combination  of  many  low-level  features  using  a  global, 
video-level  representation.  It  is  arguable  that  engineering 
a  combination  of  many  complementary  low-level  features 
is  necessary  for  excellent  performance  on  this  dataset,  and 
the  method  we  propose  can  be  used  with  a  multitude  of 
features  in  this  manner.  Eurthermore,  our  multiple  kernel 
learning  algorithm  offers  an  extension  that  allows  for  such 
feature  combination  in  conjunction  with  latent  SVMs.  With 
this  novel  approach,  more  detailed  comparisons  between  la¬ 
tently  selected  video  segments  can  be  considered. 

Other  video  classification  work  includes  Niebles  et 
al.  [8],  who  developed  a  related  model  for  human  action 
recognition,  but  used  a  fixed,  single  temporal  ordering  of 
key  poses  around  anchor  points  -  which  may  break  down 
in  internet  videos  due  to  temporal  clutter.  Tang  et  al.  [12] 
extended  this  line  of  work  to  consider  temporal  segmenta¬ 
tion  via  a  variant  of  an  HMM.  Cao  et  al.  [1]  considered  a 
“scene  aligned  pooling”  feature  representation  to  capture 
the  different  scenes  present  in  a  single  video.  In  contrast 
to  the  above,  our  method  focuses  on  intra-class  variation 
and  temporal  scatter  of  an  event  by  using  latent  variables 
to  compose  a  test  video  in  a  kemelized  framework.  In  di¬ 
rect  comparisons,  we  show  empirically  that  our  approach 
outperforms  these  previous  methods. 

The  approach  we  take  to  modeling  internet  videos  is 
weakly  supervised  -  only  a  video-level  category  label  is 
provided  during  training.  Segments  and  their  associated 
scene  types  that  compose  a  video  are  learned  in  an  unsu¬ 
pervised  fashion.  Izadinia  and  Shah  [4]  developed  a  sim¬ 
ilar  method,  but  with  manual  annotations  on  the  training 
data  -  extending  the  image- attribute  method  of  Wang  and 
Mori  [17]  to  the  video  domain. 

Technically,  the  proposed  approach  is  most  closely  re¬ 
lated  to  [18,  20,  3],  but  differentiates  itself  by  presenting 
a  novel  multiple  kernel  learning  approach  that  accommo¬ 
dates  structured  latent  variables.  In  comparison,  Wu  and 
Jia  [18]  and  Yang  et  al.  [20]  developed  kemelized  variants 
of  the  latent  support  vector  machine  [2,  21].  However,  the 
algorithms  for  learning  kemelized  latent  SVMs  in  these  pa¬ 


pers  have  two  drawbacks:  they  are  limited  to  cases  where 
one  can  enumerate  the  set  of  latent  variables  and  they  are 
restricted  to  a  single  kernel  or  a  set  of  summed  kernels. 
Einally,  Gu  et  al.[3]  consider  low  level  concept  detection 
(e.g.  fiag,  car,  building)  using  a  bag-instance  relationship 
whereas  ours  examines  high-level  event  recognition. 

Kemelized  classifiers  often  offer  superior  performance. 
A  body  of  work  has  aimed  at  providing  efficient  training 
and  evaluation  with  kemelized  classifiers  via  algorithmic 
optimizations  or  additive  linear  approximations  [15,  6,  10]. 
This  line  of  work  is  promising,  but  has  yet  to  be  extended 
to  latent  variable  models,  as  is  done  here. 

3.  Compositional  Models  for  Video  Retrieval 

We  are  interested  in  the  classification  of  high-level  com¬ 
plex  events  in  unconstrained  internet  videos.  Two  signif¬ 
icant  challenges  in  this  domain  are  temporal  clutter  (i.e., 
the  evidence  of  a  complex  event  can  occur  in  small,  iso¬ 
lated  video  segments)  and  intra-class  variation.  In  this  pa¬ 
per,  we  target  both  the  intra-class  variation  and  temporal 
clutter  challenges  by  leveraging  a  compositional  model. 

Early  successes  on  the  TRECVID  MED  11  dataset  have 
often  deferred  to  an  approach  where  the  output  of  an  array 
of  simple  classifiers  operating  on  a  range  of  low-level  fea¬ 
tures  are  combined  [7].  These  approaches  have  tended  to 
employ  simple,  bag  of  words  (BoW)  representations  with 
kemelized  SVM  classifiers.  In  such  systems,  the  standard 
kemelized  SVM  can  be  thought  of  as  a  form  of  intelligent 
template  matching,  whereby  a  test  video  is  compared  di¬ 
rectly  against  the  set  of  support  vectors.  Such  approaches 
can  perform  effective  matching  on  global  video-level  repre¬ 
sentations,  but  are  not  well- suited  for  segment-level  analy¬ 
sis.  By  introducing  latent  variables  in  our  proposed  method, 
kemelized  latent  SVMs  are  constructed  that  select  particu¬ 
larly  salient  video  segments.  Thus,  this  intelligent  template 
matching  can  now  be  completed  not  only  at  the  video  level, 
but  also  at  the  segment  level.  This  approach  provides  our 
compositional  model  with  the  additional  fiexibility  to  mix 
and  match  segments  from  the  pool  of  training  videos  when 
evaluating  a  test  video,  directly  addressing  the  challenges  of 
clutter  and  intra-class  variation. 

Additionally,  to  attain  state-of-the-art  performance  on 
TRECVID  MEDll,  it  appears  that  multiple  feature  types 
must  be  combined.  We  further  extend  our  model  to  combine 
multiple  kernel  learning  with  the  kemelized  latent  SVM 
framework,  adding  the  ability  to  weight  feature  types  based 
on  their  relative  importance. 

3.1.  Linear  Model 

To  begin  the  exposition  we  describe  the  linear  version 
of  our  model,  which  consists  of  two  parts.  The  first 
part  is  a  global  model  that  captures  the  overall  theme  or 
''subcategory''  of  the  video.  It  is  assumed  that  each  event 
category  contains  several  subcategories  (e.g.,  a  wedding 
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Global  Models  Scene  Type  Models 


Figure  2:  Depiction  of  our  proposed  model.  The  global 
model  captures  the  subcategories  of  an  event,  and  the  scene 
model  represents  the  different  scene  types  observed  in  the 
category.  The  presence  of  a  subcategory  or  scene  type  is 
represented  using  binary  variables  (6c,  Zs).  The  temporal 
position  of  scene  types  in  a  video  is  denoted  by  tg . 

ceremony  at  a  church,  house,  or  park).  Further,  it  is  assumed 
that  a  particular  video  corresponds  to  only  one  subcategory. 
The  second  part  of  our  formulation  is  a  ''scene  type  modeV 
that  represents  an  event  by  a  set  of  segment-level  features. 
This  part  of  the  model  is  included  to  identify  and  localize 
discriminative  segments  of  interest  in  a  video.  The  model  is 
depicted  graphically  in  Fig.  2. 

We  consider  eight  second  segments  that  correspond  to 
scenes  observed  within  the  event  category  (e.g.,  for  wed¬ 
ding  ceremony  videos,  outdoor  park  scenes  or  people  danc¬ 
ing,  cutting  a  cake,  or  kissing).  A  weakly  supervised  setting 
is  considered,  meaning  that  we  are  only  given  a  binary  event 
label  for  each  video  that  indicates  the  presence  of  a  complex 
event  in  the  sequence;  the  subcategory  labels,  scene  type  la¬ 
bels,  and  temporal  locations  of  scene  types  are  not  provided. 
These  are  modeled  as  hidden  variables  and  we  employ  a  la¬ 
tent  max-margin  approach  [2]  to  infer  them  during  training. 

Concretely,  assume  we  are  given  a  video  sequence  x,  and 
want  to  classify  it  into  an  event  category.  The  variables  C 
and  S  denote  the  number  of  subcategories  and  scene  types 
for  an  event,  respectively.  The  presence  of  a  subcategory 
c  G  {l,2,...,C}is  defined  using  the  binary  variable  bc\ 
similarly,  the  presence  of  a  scene  type  s  G  {l,2,...,S'}is 
denoted  using  the  binary  variable  2^5 . 

We  define  (j)g{x),  a  global  feature  extracted  from  the 
whole  sequence,  and  (j)i{x^t)  a  segment-level  feature  ex¬ 
tracted  from  a  temporal  window  of  fixed  size  centered  at 
time  tmx.  Multiple  features  are  incorporated  to  improve 
accuracy:  G  global  and  L  local  (segment-level)  features. 
Together,  the  linear  version  of  our  model  is  defined  as: 

C  G  S  L 

/4x,b,h)  =  ^^  W^g4>g{x)bc  +  EE  wJi(pl{x,ts)Zs  (1) 
c=l  g=l  s=l  1  =  1 

where  Wcg  is  the  learned  weight  vector  for  the  subcat¬ 
egory  model  on  the  global  feature  (l)g{'),  and  Wsi  is  the 
weight  vector  for  the  scene  type  model  defined  on  the 
segment-level  feature  0z(')-  Use  of  the  same  set  of  feature 
types  in  the  global  and  segment-level  scales  can  be  achieved 
by  setting  G  =  L.  However,  more  generally,  our  model  sup¬ 


ports  the  added  flexibility  of  using  different  sets  of  features 
for  the  two  parts.  For  notational  compactness,  we  represent 
the  pair  (f^,  Zs)  using  hs  for  s  G  {1,  2, ...,  S},  and  group 
them  in  vector  h  =  {6^1, 6,5}.  We  similarly  group 
subcategory  binary  variables  in  b  =  {61, 62, 60}. 

Note  that  the  model  in  Eq.  1  assumes  the  temporal  loca¬ 
tion  for  the  scene  type  is  shared  among  all  segment-level 
features  types  -  they  are  all  extracted  from  the  same  tempo¬ 
ral  window  in  the  sequence. 

It  is  assumed  that  a  sequence  can  belong  to  only  one 
global  subcategory,  but  multiple  scene  types  might  be  ob¬ 
served  in  a  sequence,  corresponding  to  the  various  seg¬ 
ments.  Therefore,  two  hard  constraints  are  imposed  on  the 
selecting  binary  variables:  Ylc=i  =  1,  and  X]f=i  = 
K,  where  AT  is  a  constant  parameter. 

The  subcategory  variables,  be,  and  scene  model  configu¬ 
rations,  hs,  are  latent  variables,  unobserved  on  both  training 
and  testing  data.  Next,  we  develop  a  novel  multiple  kernel 
learning  approach  for  learning  with  these  latent  variables. 

3.2.  Multiple  Kernel  Latent  SVM 

Latent  SVMs  have  been  successfully  used  in  many  com¬ 
puter  vision  tasks.  They  were  originally  proposed  for  linear 
models  [21,  2],  where  the  similarity  of  two  samples  is  mea¬ 
sured  using  a  simple  dot  product.  Recently,  LSVMs  were 
extended  to  kernelized  versions  [20,  18]  resulting  in  signifi¬ 
cant  boosts  in  recognition  accuracy.  However,  both  [20,  1 8] 
assumed  simple  models  with  few  latent  variables  that  could 
be  enumerated  during  inference.  In  our  proposed  model, 
latent  variables  are  defined  in  a  structured  framework  such 
that  enumeration  is  not  tractable. 

The  use  of  multiple  complementary  features  can  lead  to 
improved  recognition  accuracy.  With  multiple  features,  fu¬ 
sion  is  a  challenge  because  the  importance  of  feature  types 
is  variable.  Multiple  kernel  learning  is  a  standard  approach 
to  address  this  challenge.  A  linear  MKL  SVM  framework 
(e.g.,  [16])  typically  performs  such  fusion  by  linearly  com¬ 
bining  a  set  of  kernels  K  =  diKi,  which  corresponds 
to  re-scaling  feature  maps  of  the  kernel,  by 

The  linear  model  in  Eq.  1  is  also  defined  with  respect  to 
multiple  features.  We  require  a  training  framework  that  can 
accommodate  both  latent  variables  and  feature  re-scaling  si¬ 
multaneously.  We  propose  a  novel  multiple  kernel  latent 
SVM  framework  that  extends  standard  MKL  and  can  be 
used  to  train  models  of  the  form  proposed  in  this  paper. 

Consider  a  set  {(xi, ^1),  (x2, ^2),  •  •  • ,  (^iv,  ^at)}  of 
training  videos  where  G  A'  is  the  video  and  i/i  G 
{  —  1,1}  its  label.  Our  goal  is  to  learn  a  scoring  func¬ 
tion  F  \  X  IZ  that  can  be  used  to  classify  a  video. 
Similar  to  the  standard  latent  SVM,  the  proposed  multi¬ 
ple  kernel  latent  SVM  (MKL-KLSVM^)  operates  upon  a 
set  of  base  feature  maps,  T^i(x,  v),  defined  on  a  sample  x 
and  its  latent  variables  v  G  V,  where  V  is  the  set  of  all 

^We  use  MKL-KLSVM  for  Multiple  Kernel  Latent  SVM  to  prevent 
confusion  with  Multiple  Kernel  Learning  SVM  (MKL  SVM) 


1187 


possible  latent  variables.  We  define  the  scoring  function 
F{x)  =  maxv  v)  where  di  is  the  nor¬ 

malizing  factor  for  the  base  feature  map.  Training  of  the 
MKL-KLSVM  is  then  formulated  as: 


mm 

w,b,^>0,d>0  2 


l'^wfwi+  p^^n  +  ^'^df 


(2) 


s.t.  yn(maxy^  VFiw'[^i{xn,^^)  F  b)  >  1  -  Vn, 

vGVr  ^ ^ 


where  A  is  a  regularizer  on  the  kernel  weights,  di  to  prevent 
them  from  diverging  to  infinity,  and  p  is  a  trade-off  param¬ 
eter  to  penalize  error  on  the  training  data.  Note  that  our 
multiple  kernel  latent  SVM  framework  becomes  a  standard 
latent  SVM  [2]  if  the  kernel  coefficients,  di,  are  set  to  one 
and  will  become  a  standard  MKL  classifier  if  the  hidden 
variables  are  observed. 

The  objective  function  in  Eq.  2  is  not  convex;  however, 
convexity  is  attained  if  the  latent  variables  for  positive  sam¬ 
ples  are  available  (semi-convexity  of  latent  SVM  [2])  and 
if  Wi  is  replaced  with  ^/diWi.  Here  we  limit  the  possible 
latent  variables  of  positive  samples  to  a  single  configuration 
Vn  =  {v*  }  Vn  :  pn  =  1,  but  allow  negative  samples  to 
consider  all  possible  latent  variables,  Vn  Vn  :  pn  =  —  1- 
Given  that  the  latent  variable  configuration  has  been  speci¬ 
fied,  the  max  operator  can  be  omitted  from  Eq.  2,  yielding. 


mm 

t(;,6,^>0,d>0  2 


wf  Wi  A  2 

+  P2^^r.  +  ^}_^di 


di  '  ■  2 

n  z 

s.t.  -\-b)  >1-  Vn, Vv  G  Vn 


(3) 


The  objective  function  in  Eq.  3  addresses  the  problem  of 
learning  parameters  of  a  structural  SVM  with  multiple  ker¬ 
nels.  It  has  N~  I  V|  +  constraints,  where  N~  and  N~^  are 
the  number  of  negative  and  positive  samples  respectively.  If 
the  latent  variables  are  structured,  |V|  will  be  exponential. 
The  same  problem  of  exponential  constraints  is  confronted 
with  linear  latent  SVMs  as  well.  Yu  and  Joachims  [21]  use 
the  cutting  plane  algorithm  [13]  to  ameliorate  this  challenge 
by  mining  hard  constraints  and  iteratively  optimizing  with 
and  updating  the  current  constraints. 

We  use  the  cutting  plane  algorithm  to  extract  the  set  of 
most  violated  constraints  for  negative  samples  during  train¬ 
ing,  while  the  latent  variables  of  positive  videos  remain 
fixed.  Here,  Vn  denotes  the  set  of  current  active  constraints 
(instead  of  Vn,  which  represents  all  the  constraints  defined 
over  all  possible  latent  variables).  The  set  of  active  con¬ 
straints,  Vn,  contains  just  a  single  constraint  per  positive 
sample,  but  can  have  multiple  constraints  for  negative  sam¬ 
ples,  extracted  using  the  cutting  plane  algorithm. 

Given  a  current  set  of  constraints,  a  method  is  required 
for  optimizing  Eq.  3.  By  forming  the  Lagrangian  of  Eq.  3 
and  minimizing  the  objective  function  with  respect  to  Wi,  ^ 
and  b,  we  obtain 

Wi  =  di  (4) 

n,vGVn 


where  is  the  Lagrangian  variable  for  the  ^th 

sample 

and  the  latent  variables,  v.  Substituting  Wi  in  Eq.  3  yields 


min  maxLfo,  d) 

d>0  a 


n,v  i 


(5) 


EE  On,vOn%^v'ynym^ ^^(^mjV  ) 

n,v'  m,v' 


‘.t.  0  ^  ^  ^  OrijV  E  Pi  ^  ^  ynOn,v  —  0, 


which  is  an  instance  of  the  saddle  point  problem.  In 
Eq.  5,  T^i(xn,  v')  can  be  replaced  with  a  ker¬ 

nel  k{xni  V,  Xmi  v')  that  measures  the  similarity  of  Xn  and 
Xm,  given  their  latent  configurations.  If  the  kernel  weights, 
d,  are  fixed  in  Eq.  5,  the  inner  maximization  will  be¬ 
come  the  Quadratic  Program  (QP)  of  a  kemelized  struc¬ 
tural  SVM  [13].  We  solve  the  saddle  point  problem  by  it¬ 
eratively  updating  d  and  subsequently  performing  QP  op¬ 
timization  for  a  with  a  fixed  d.  The  kernel  weights  can  be 
updated  using  a  Newton  descent  step  or  the  cutting  plane  ap¬ 
proach  [5].  Alternatively,  the  Lagrangian  of  Eq.  5  can  be  de¬ 
rived  to  form  the  dual  problem,  which  is  differentiable  and 
can  be  optimized  using  the  sequential  minimal  optimization 
(SMO)  algorithm  [11],  similar  to  [16]. 

Here,  we  elect  to  use  the  simple  Newton  descent  ap¬ 
proach.  Given  the  optimum,  a*,  from  iteration  r,  in  itera¬ 
tion  r  + 1  an  update  is  computed  as  d^~^^  =  d^  — 
where  /i  =  Ms  the  step  size.  Additionally,  H  =  XI  is  the 
Hessian  matrix  of  L{a*^d)  (I  is  the  identity  matrix),  and 
VLi{a\d)  =  Xd]  -  5II  is  the 

the  derivative  of  L  with  respect  to  d^ .  If  a  Newton  descent 
update  results  in  a  negative  kernel  weight,  it  is  back  pro¬ 
jected  using  =  0  if  d'l^^  <  0. 

After  updating  the  kernel  weights,  the  inner  quadratic 
program  in  Eq.  5  is  solved  by  assuming  d  is  fixed.  We  it¬ 
erate  between  these  two  steps  until  the  optimization  con¬ 
verges  and  the  objective  function  does  not  change.  Given 
the  final  a*  and  d*  (which  together  represent  w),  we  infer 
the  latent  variables  on  the  positive  examples  using  v*  = 
arg  maxv  v).  It  has  been  shown  for  stan¬ 

dard  linear  latent  SVMs  that  iteratively  updating  the  latent 
variables  of  positive  samples  and  learning  the  latent  SVM 
model  parameters  will  minimize  the  objective  function  to  a 
local  optimum  [21,2].  The  same  argument  holds  for  multi¬ 
ple  kernel  latent  SVM.  Algorithm  1  provides  a  summary  of 
our  proposed  training  algorithm. 

3.3.  Kemelized  Model 

We  use  multiple  kernel  latent  SVM  to  train  the  parame¬ 
ters  of  our  model  defined  in  Eq.  1 .  However,  we  still  must 
define  v),  the  base  features,  and  their  corresponding 
kernels  that  have  an  associated  re- scaling  coefficient  di  as  in 
Eq.  2.  Eor  the  linear  model  defined  in  Eq.  1  global  models 
were  defined  on  G  global  features  while  scene  type  models 
employed  L  segment-level  feature  types.  Specifically,  the 
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Algorithm  1  Training  a  multiple  kernel  latent  SVM 

Input :  {x2, 2/2)  •  •  • ,  {xn,  Vn)} 

Output :  a*,  (i* 

K  }  \/n:yn  =  1,  Vn  =  {}  Vn  ;  y„  =  -1 

repeat 

repeat 

Optimize  Eq.  3  using  iterative  Newton  descent  and 

QP  given  the  current 

\/n  :  yn  =  -I  add  the  most  violated  constraint  to  Vn 
until  no  change  in  objective  function  of  Eq.  3 
\/n\yn  =  I  update  Vn  =  arg  maxv  wf  ^i{xn,  v) 
until  no  change  in  VnVn  :  yn  =  I 


base  features  in  Eq.  2,  are  defined  as  Ylc=i  for 

the  global  features  and  for  the  segment- 

level  features,  which  are  derived  from  Eq.  1.  Thus,  G  ^  L 
kernels  are  defined  as 

c 

Kg{x,h,x  ,h')  =  '^bckg(x,x)b'c, 

C=1 

S 

Ki{x,h,x\h')  =  '^Zski{x,ts,x\t's)Zs.  (6) 

S  =  1 

Given  two  videos,  x  and  x'.  Kg  measures  the  kernelized 
similarity  of  their  global  feature  if  they  belong  to  the  same 
subcategory;  otherwise,  it  assigns  zero  similarity.  Analo¬ 
gously,  Ki  measures  the  kernelized  similarity  of  segment- 
level  feature  I  for  sequences  x  and  x'  at  times  tg  and  for 
the  scene  models  that  are  present  in  both  x  and  x' . 

Given  the  kernels  defined  in  Eq.  6,  Alg.  1  is  used  to 
learn  a*  and  d*,  the  parameters  of  the  proposed  kernel¬ 
ized  model.  We  can  substitute  these  parameters  in  Eq.  1 
to  rewrite  our  scoring  function  for  the  kernelized  model: 

G 

F{x)  —  [  E  E«  n,(h,T,,b^)2/ndg,  ,  X^  b) 

9  1 

L 

+  E  E«:  ,(hr„h„)yndtKl{Xn,hn,X,h)^,  (7) 

n,(hrj,,b^)  1  =  1 

where  (h„,b„)  e  Vn  are  latent  variables  defined  for  the 

r^th 

training  sample. 

The  completed  model  in  Eq.  7  is  the  full,  proposed  com¬ 
positional  model.  Given  the  sequence,  x,  maximization 
matches  the  sequence  to  the  training  videos  by  choosing 
segment  locations,  h,  and  the  subcategory  model,  b,  that 
are  well-explained  by  the  training  videos.  A  test  video,  x, 
is  assigned  a  high  score  for  an  event  category  if  it  is  similar 
to  its  associated  positive  training  videos  using  two  criteria. 
Eirst,  the  global  features  from  the  test  video  should  be  sim¬ 
ilar  to  the  global  features  from  training  videos.  Second,  the 
test  video  should  contain  segments  that  are  similar  to  those 
in  the  training  set.  Under  this  framework,  the  test  video 
can  be  composed  using  components  from  numerous  train¬ 
ing  videos  at  both  the  global  and  segment  scale.  The  learned 


kernel  coefficients,  d,  allow  for  the  re-scaling  of  the  similar¬ 
ity  measures  on  different  parts  of  model.  This  rescaling  can 
give  higher  weights  to  important  feature  types  while  allow¬ 
ing  for  the  extraction  of  the  most  discriminative  evidence 
from  the  training  set,  using  (h^,  b^). 

3.4.  Implementation  Details 

Simple  heuristics  are  used  to  initialize  the  latent  vari¬ 
ables  for  the  positive  samples.  Eor  the  subcategory  labels, 
we  cluster  the  concatenated  global  features  of  the  positive 
videos  into  C  clusters.  Subsequently,  we  assign  a  video  to 
the  closest  cluster.  Eor  the  scene  models,  we  similarly  clus¬ 
ter  the  concatenated  segment-level  features  of  all  segments 
from  the  positive  training  videos.  Then,  we  choose  the  K 
closest  clusters  to  the  video  segments,  and  set  the  temporal 
location  of  each,  tg ,  to  the  closest  segment. 

Inference:  Eor  inferring  latent  variables,  we  first  need 
to  compute  the  global  and  scene  model  scores  for  each  sub¬ 
category  and  scene  type.  Eor  a  general  kernel  type,  there 
is  no  explicit  form  of  Wi  and  direct  comparison  to  support 
vectors  is  necessary  to  compute  the  scores.  Kernel  compar¬ 
ison  can  significantly  slow  down  the  inference.  Given  Ng 
support  vectors,  considering  Eq.  7,  Eq.  6  and  sparsity  of  bn 
and  z  in  hn,  0{NgG  -f  NgKLT)  kernel  comparisons  will 
be  required  to  compute  the  scores  for  a  sequence.  How¬ 
ever,  with  additive  kernels  we  can  approximate  the  embed¬ 
ding  feature  [14],  and  form  an  approximated  Wi  using  Eq.  4. 
Thus,  the  number  of  linear  kernel  computations  becomes 
0{GGVSLT). 

Consider  the  model  in  Eig  1.  Now,  given  global  and 
scene  type  model  scores,  we  need  to  infer  the  subcategory 
variables  be  and  temporal  locations  tg  of  the  K  best  scene 
type  models.  The  subcategory  can  be  found  in  0{G).  Eor 
a  video  with  T  segments,  the  best  location  for  each  scene 
type  is  found  in  0{T),  and  then  the  K  best  scenes  are  se¬ 
lected  in  0{S  log{K))  using  a  min  heap.  So,  the  complexity 
of  inference  is  0{G  -f  ST  -f-  Slog{K))  in  addition  to  the 
score  computation.  In  our  experiments,  this  inference  takes 
0.05  seconds  for  a  120-second  video  on  an  Intel  CPU  E7450 
@  2.40GHz. 

4.  Experiments 

We  evaluate  our  model  on  the  challenging  TRECVID 
MED  11  dataset  [9],  following  a  standard  evaluation  pro¬ 
tocol  used  in  previous  work  [12].  The  TRECVID  MED  11 
dataset  contains  15  events  that  are  divided  across  two  col¬ 
lections,  DEV-T  and  DEV-0.  The  DEV-T  dataset  consists 
of  10,723  videos  including  videos  from  five  event  cate¬ 
gories:  board  trick  (El),  feeding  animal  (E2),  landing  fish 
(E3),  wedding  ceremony  (E4),  and  woodworking  project 
(E5).  The  DEV-0  collection  is  significantly  larger,  32,061 
videos,  and  includes  ten  categories:  birthday  party  (E6), 
changing  a  tire  (E7),  flash  mob  (E8),  getting  a  vehicle 
unstuck  (E9),  grooming  animal  (ElO),  making  sandwich 
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Table  1 :  Performance  variation  on  the  DEV-T  dataset  as  a 
function  of  model  parameters:  the  number  of  subcategories 
(C),  number  of  scene  types  (S),  and  number  of  selected 
scenes  (K).  Selection  is  done  for  each  parameter  in  turn 
and  is  fixed  for  subsequent  parameters,  as  shown  in  red. 


Model  Settings 

El 

E2 

E3 

E4 

E5 

mAP 

C  =  l,S  =  t) 

14.2 

3.8 

16.7 

34.4 

8.4 

15.5 

C  =  2,  S  =  t) 

14.1 

3.9 

17.6 

35.8 

8.5 

16.0 

14.3 

3.7 

16.8 

34.3 

13.7 

16.6 

o 

II 

oo 

II 

o 

13.8 

3.8 

18.3 

40.7 

16.6 

18.6 

C  =16,S  =  0 

12.1 

3.9 

17.3 

38.8 

15.1 

17.4 

II 

II 

oo' 

II 

O 

12.3 

2.8 

24.0 

44.4 

13.3 

19.4 

C  =  8,S  =  K  =  8 

11.1 

2.6 

25.3 

44.6 

12.8 

19.2 

C  =  8,S  =  K  =  16 

13.3 

2.3 

26.8 

43.9 

14.8 

20.2 

C  =  8,S  =  K  =  32 

13.1 

2.1 

27.2 

44.6 

14.3 

20.2 

C  =  8,  S  =16,K  =  1 

15.3 

3.3 

20.1 

42.3 

16.6 

19.5 

C  =  8,S  =16,K  =  2 

14.8 

3.4 

24.1 

46.1 

18.4 

21.4 

C  =  8,S  =16,K  =  4 

17.4 

3.2 

26.3 

46.3 

17.5 

22.1 

C  =  8,S  =16,K  =  8 

12.8 

2.9 

29.0 

48.5 

17.9 

22.2 

C  =  8,  S  =  16,  K  =  16 

13.3 

2.3 

26.8 

43.9 

14.8 

20.2 

(Ell),  parade  (Ell),  parkour  (E13),  repairing  appliance 
(E14),  and  sewing  project  (E15).  Both  DEV-T  and  DEV- 
O  are  dominated  by  videos  of  the  null  category  (i.e.,  back¬ 
ground  videos  that  do  not  contain  the  events  of  interest).  Eor 
training,  an  Event-Kit  data  collection,  containing  roughly 
150  positive  videos  per  category,  is  also  provided.  A  classi¬ 
fier  is  trained  for  each  event  category  versus  all  other  cate¬ 
gories,  similar  to  [12]. 

Eor  TRECVID  MEDl  1,  DEV-T  is  used  for  development, 
whereas  DEV-0  is  utilized  for  testing.  Thus,  we  performed 
cross  validation  of  all  system  parameters  and  hyper  param¬ 
eters  on  DEV-T  and  held  them  constant  when  considering 
DEV-0.  We  use  mean  average  precision  (mAP)  as  the  per¬ 
formance  metric  to  remain  comparable  with  recently  pub¬ 
lished  works  [1,  12]. 

4.1.  Comparisons  using  HOG3D  Features 

Eirst,  we  evaluated  our  proposed  method  against  several 
baselines.  This  evaluation  uses  HOG3D  features,  k-means 
quantized  into  a  1,000  word  codebook  for  all  methods. 
Eor  this  experiment,  we  use  the  following  set  of  baselines: 
Linear-SVM,  a  linear  SVM  using  HOG3D  BoW  features; 
KSVM,  same  video-level  features  with  histogram  intersec¬ 
tion  kernel  (HIK)  SVM;  Niebles  [8];  Tang  [12];  Linear- 
SAP,  the  scene-aligned  pooling  method  [1]  using  a  linear 
SVM;  and  K-SAP,  the  same  method  using  a  HIK-SVM.  Re¬ 
sults  for  Niebles  and  Tang  are  reproduced  from  [12]  and  we 
obtained  exactly  the  same  quantized  features  to  be  directly 
comparable.  Also,  note  that  we  re-implemented  the  scene 
aligned  pooling  method  [1]  using  parameters  suggested  by 
the  authors  to  permit  direct  comparisons. 

Two  variants  of  our  proposed  model  were  considered: 
Linear-LSVM,  using  a  linear  latent  SVM,  and  KLSVM, 
using  a  HIK  latent  SVM.  Eor  the  proposed  models,  selec¬ 
tion  of  appropriate  parameters  is  required,  including  the 


number  of  subcategories  (C),  number  of  scene  types  (S), 
and  number  of  selected  scenes  (K).  We  used  the  kernel- 
ized  version  of  our  model  with  a  HIK  kernel  to  choose  the 
best  parameters  on  DEV-T  (El  to  E5)  and  fixed  them  for 
all  subsequent  experiments  using  our  model  in  this  paper. 
Parameters  were  selected  based  on  the  criteria  of  mAP  per¬ 
formance  and  model  complexity.  Interestingly,  as  Table  1 
shows,  as  the  various  components  of  our  model  are  added, 
mAP  is  improved.  In  particular,  our  latent  model  with  se¬ 
lected  parameters  (C  =  8,5'  =  16,  AT  =  4)  outperforms  the 
standard  kernelized  SVM  (C  =  1,  5  =  0)  by  6.6%  in  mAP. 

In  this  section,  our  novel  multiple  kernel  learning  formu¬ 
lation  is  not  employed,  since  the  number  of  kernels  used  is 
very  small.  Section  4.2  considers  experiments  with  the  full 
model,  using  MKL  for  multi-feature  fusion. 

Results  for  the  six  baselines  and  two  variants  of  the  pro¬ 
posed  method  on  DEV-0  are  shown  in  Table  2.  When 
considering  only  models  that  employ  linear  SVMs  (i.e., 
Linear-SVM,  Niebles,  Tang,  Linear-SAP,  and  Linear- 
LSVM),  the  recently  proposed  scene  aligned  pooling 
method  provides  highest  performance  with  a  mean  AP  of 
6.28%.  The  linear  variant  of  the  proposed  model  offers  mid¬ 
range  performance.  However,  the  simple  KSVM  baseline 
significantly  outperforms  all  variants  that  use  a  linear  SVM 
classifier,  including  Niebles  and  Tang,  which  model  com¬ 
plex  structure.  It  appears  that  use  of  a  kernelized  SVM  is 
critical  for  the  task  of  accurate  event  detection. 

A  second  performance  trend  can  be  identified  from  con¬ 
sidering  the  models  that  use  kernelized  SVMs  (i.e.,  KSVM, 
K-SAP,  and  KLSVM).  Specifically,  the  proposed  model, 
KLSVM,  outperforms  all  other  baselines,  including  K-SAP 
by  3.72%  and  KSVM  by  4.22%.  Eurther,  KLSVM  attains 
best  performance  on  eight  out  of  ten  event  categories,  often 
by  a  significant  margin  (e.g.,  11.43%  gap  for  E14).  These 
results  emphasize  the  importance  of  using  a  compositional 
framework.  Note  that  a  kernelized  version  of  Tang  was  not 
considered  because  it  is  not  clear  how  the  computationally 
expensive  inference  could  be  done  for  an  extension  to  ker¬ 
nel  SVMs,  especially  for  a  large  data  collection. 

4.2.  Comparisons  using  Multiple  Features 

In  this  section,  we  demonstrate  the  effectiveness  of  the 
full,  multiple  kernel  learning-based  model  by  extending 
from  a  single  feature  modality  to  six  features. 

To  demonstrate  the  full  MKL-KLSVM  model,  HOG3D 
was  supplemented  with  five  additional  features  from  the 
Sun09  set  [19].  The  additional  features  were:  sparse  SIFT, 
dense  SIFT,  HOG2x2,  self-similarity  descriptors  (SSIM), 
and  color  histograms.  Here,  the  same  set  of  features  was 
used  for  both  the  global  and  scene  type  parts  of  our  model 
(i.e.,  G  =  1/  =  6).  These  particular  features  were  se¬ 
lected  because  we  empirically  found  them  to  offer  best 
performance  on  TRECVID  MEDll.  Features  were  ex¬ 
tracted  at  four  second  time  increments,  synchronized  with 
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Table  2:  Performance  comparison  against  several  baselines  using  HOG3D  features  on  DEV-0  for  E6-E15.  Numbers  denote 
the  average  precision,  in  %.  Best  results  for  a  particular  event  category  are  shown  in  bold. 


Event 

Chance 

Linear-SVM 

Niebles  [8] 

Tang  [12] 

Linear-SAP  [1] 

Linear-LSVM 

KSVM 

K-SAP  [1] 

KLSVM 

E6 

0.54 

1.97 

2.25 

4.38 

2.77 

2.34 

6.08 

4.73 

5.73 

E7 

0.35 

1.25 

0.76 

0.92 

2.11 

1.33 

2.87 

2.26 

4.81 

E8 

0.42 

6.48 

8.30 

15.29 

25.48 

10.30 

20.75 

22.99 

35.82 

E9 

0.26 

2.15 

1.95 

2.04 

4.14 

1.79 

6.25 

7.61 

8.38 

ElO 

0.25 

0.81 

0.74 

0.74 

1.03 

0.76 

1.43 

1.34 

2.12 

Ell 

0.43 

1.10 

1.48 

0.84 

1.93 

1.41 

2.29 

2.65 

4.65 

E12 

0.58 

5.83 

2.65 

4.03 

7.06 

5.71 

8.44 

8.70 

10.99 

E13 

0.32 

2.58 

2.05 

3.04 

10.38 

2.57 

9.44 

10.43 

13.11 

E14 

0.27 

1.18 

4.39 

10.88 

6.69 

4.58 

10.00 

11.89 

23.32 

E15 

0.26 

0.92 

0.61 

5.48 

1.21 

1.09 

2.49 

2.4 

3.29 

mAP 

0.37 

2.43 

2.52 

4.77 

6.28 

3.19 

7.00 

7.50 

11.22 

the  HOG3D  features.  The  two  coarser  scales  of  a  three  level 
spatial  pyramid  were  retained  for  dense  SIET,  HOG2x2,  and 
SSIM.  Sparse  SIET  and  color  histograms  were  extracted 
on  the  whole  frame.  Global  and  segment-level  features  are 
formed  by  averaging  the  histograms. 

Three  baselines  are  compared  against  the  full  MKL- 
KLSVM,  all  systems  using  the  identical  set  of  six  features. 
The  first  baseline,  KSVM,  is  trained  on  a  summation  of 
six  kernels  on  the  global  features.  The  second  baseline, 
MKL-SVM,  is  similar  to  KSVM,  but  the  weights  on  the 
kernels  are  trained.  KLSVM  and  MKL-KLSVM  are  vari¬ 
ants  of  our  model  that  consider  both  the  global  and  segment- 
level  features.  Global  models  and  scene  type  models  are 
formed  using  and  HIK,  respectively.  In  the  KLSVM,  the 
weights  of  all  kernels  are  fixed  to  one,  while  in  the  MKL- 
KLSVM,  the  kernel  weights  are  learned. 

Table  3  presents  the  results  of  these  systems  for  DEV-0. 
A  progression  in  the  mAP  performance  is  demonstrated  as 
the  different  components  of  our  model  are  added.  By  al¬ 
lowing  the  model  to  learn  the  kernel  weights  for  the  var¬ 
ious  feature  modalities,  MKL-SVM  shows  slight  perfor¬ 
mance  gains  over  KSVM.  KLSVM  improves  performance 
by  incorporating  our  proposed  compositional  model  that 
performs  latent  segment  selection.  Einally,  when  consider¬ 
ing  the  full  model,  MKL-KLSVM,  which  allows  the  vari¬ 
ous  kernel  weights  to  be  adapted  for  the  global  and  segment 
components  across  multiple  features,  highest  overall  accu¬ 
racy  is  attained. 

4.3.  Results  Visualizations 

Eigure  3  shows  qualitative  results  for  our  model  on  four 
test  videos,  where  eight  second  segments  are  visualized  us¬ 
ing  their  center  frames.  The  frames  that  are  latently  selected 
tend  to  be  discriminative  and  ignore  temporal  clutter  inher¬ 
ent  in  many  test  videos.  Eor  example,  in  the  sewing  project 
video,  the  latter  frames  where  the  individual  is  walking  in  an 
outdoor  environment  are  not  selected  because  such  scenes 
are  not  typically  associated  with  a  video  of  a  sewing  project. 

Latently  selected  frames  of  the  same  scene  type  model 
also  often  have  similar  overall  appearance  characteristics. 


Table  3:  Performance  comparison  against  several  baselines 
using  multiple  features  on  DEV-0  for  E6-E15.  Numbers 
denote  the  average  precision,  in  %. 


Event 

KSVM 

MKL-SVM 

KLSVM 

MKL-KLSVM 

E6 

6.36 

6.77 

5.36 

6.24 

E7 

22.04 

22.22 

23.47 

24.62 

E8 

31.23 

31.40 

31.99 

37.46 

E9 

18.13 

17.49 

16.18 

15.72 

ElO 

2.48 

2.55 

2.36 

2.09 

Ell 

3.88 

4.03 

7.98 

7.65 

E12 

10.90 

11.00 

10.77 

12.01 

E13 

13.31 

14.54 

13.70 

10.96 

E14 

12.97 

12.34 

31.22 

32.67 

E15 

3.98 

3.81 

7.47 

7.49 

mAP 

12.53 

12.62 

15.05 

15.69 

Eor  instance,  in  the  grooming  animal  test  video,  the  frame  in 
the  green  box  shows  a  view  of  a  dog’s  backside  with  human 
hands  moving  its  tail.  A  support  vector  containing  a  frame 
for  this  scene  type  showing  a  comparable  view  of  a  dog  with 
extended  human  arms  is  also  selected. 

The  visualizations  also  demonstrate  the  compositional 
approach.  Eor  example,  in  the  changing  a  tire  test  se¬ 
quence,  two  of  the  top  three  support  vector  videos  offer 
good  matches  for  three  of  the  latently  selected  frames  in  the 
test  sequence  (corresponding  to  the  test  frames  highlighted 
with  red,  yellow,  and  blue  boxes).  However,  for  the  fourth 
test  frame  that  was  selected  (green  box),  only  one  of  the  top 
three  support  vectors  provides  a  particularly  discriminative 
match.  The  proposed  model  is  able  to  accumulate  evidence 
for  classification  from  different  video  segments  in  the  pool 
of  training  videos. 

5.  Conclusion 

We  presented  a  novel,  compositional  model  for  video 
event  detection  that  leverages  a  novel  multiple  kernel  learn¬ 
ing  algorithm  that  incorporates  structured  latent  variables. 
The  kemelized  latent  variable  framework  allows  the  model 
to  select  and  match  test  video  segments  with  those  that  are 
extracted  from  the  pool  of  training  of  videos.  The  composi¬ 
tional  nature  of  the  model  allows  it  to  respond  to  the  chal¬ 
lenges  of  intra-class  variation  and  temporal  clutter,  which 
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Figure  3:  Qualitative  visualization  of  results.  Individual  images  denote  the  center  frame  from  an  eight  second  window.  Each 
subfigure  shows  frames  from  a  testing  video  along  with  frames  from  the  three  support  vectors  that  produce  the  overall  best 
match  to  that  test  video  (i.e.,  frames  from  only  three  support  vector  videos  are  shown  for  each  test  sequence).  For  a  test 
video,  the  Ff  =  4  frames  that  were  latently  selected  are  highlighted  with  colored  boxes,  where  color  denotes  the  particular 
scene  type  model.  Latently  selected  frames  from  the  the  top  three  support  vectors  are  grouped  using  colored  boxes,  where 
color  corresponds  to  the  same  scene  types  selected  for  the  test  video.  From  top-to-bottom,  left-to-right,  the  testing  videos 
correspond  to  changing  tire  (E7),  grooming  animal  (ElO),  repairing  appliance  (E14),  and  sewing  project  (El 5).  Faces  have 
been  obscured  for  privacy  considerations.  Best  viewed  magnified  and  in  color. 


are  inherent  in  unconstrained  internet  videos.  Additionally, 
since  multiple  feature  types  are  required  to  attain  state-of- 
the-art  performance  on  TRECVID  MEDl  1 ,  a  principled  ap¬ 
proach  to  feature  fusion  via  multiple  kernel  learning  with 
structured  latent  variables  is  proposed.  Experimental  re¬ 
sults  showed  that  this  approach  outperforms  state-of-the-art 
baselines  on  the  challenging  TRECVID  MED  11  dataset. 
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